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Abstract 

Multitask  learning  algorithms  are  typically 
designed  assuming  some  fixed,  a  priori 
known  latent  structure  shared  by  all  the 
tasks.  However,  it  is  usually  unclear  what 
type  of  latent  task  structure  is  the  most  ap¬ 
propriate  for  a  given  multitask  learning  prob¬ 
lem.  Ideally,  the  “right”  latent  task  struc¬ 
ture  should  be  learned  in  a  data-driven  man¬ 
ner.  We  present  a  flexible,  nonparametric 
Bayesian  model  that  posits  a  mixture  of  fac¬ 
tor  analyzers  structure  on  the  tasks.  The 
nonparametric  aspect  makes  the  model  ex¬ 
pressive  enough  to  subsume  many  existing 
models  of  latent  task  structures  (e.g,  mean- 
regularized  tasks,  clustered  tasks,  low-rank 
or  linear/non-linear  subspace  assumption  on 
tasks,  etc.).  Moreover,  it  can  also  learn  more 
general  task  structures,  addressing  the  short¬ 
comings  of  such  models.  We  present  a  vari¬ 
ational  inference  algorithm  for  our  model. 
Experimental  results  on  synthetic  and  real- 
world  datasets,  on  both  regression  and  classi¬ 
fication  problems,  demonstrate  the  effective¬ 
ness  of  the  proposed  method. 

1.  Introduction 

Learning  problems  do  not  exist  in  a  vacuum.  Often 
one  is  tasked  with  developing  not  one,  but  many  clas¬ 
sifiers  for  different  tasks.  In  these  cases,  there  is  of¬ 
ten  not  enough  data  to  learn  a  good  model  for  each 
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task  individually — real-world  examples  are  prioritiz¬ 
ing  email  messages  across  many  users’  inboxes  (Ab¬ 
erdeen  et  ah,  2011)  and  recommending  items  to  users 
on  web  sites  (Ning  &  Karypis,  2010).  In  these  set¬ 
tings  it  is  advantageous  to  transfer  or  share  informa¬ 
tion  across  tasks.  Multitask  learning  (MTL)  (Caru- 
ana,  1997)  encompasses  a  range  of  techniques  to  share 
statistical  strength  across  models  for  various  tasks  and 
allows  learning  even  when  the  amount  of  labeled  data 
for  each  individual  task  is  very  small.  Most  MTL 
methods  achieve  this  improved  performance  either  by 
assuming  some  notion  of  similarity  across  tasks — for 
example,  that  all  task  parameters  are  drawn  from  a 
shared  Gaussian  prior  (Chelba  &  Acero,  2006),  have 
a  cluster  structure  (Xue  et  al.,  2007;  Jacob  &  Bach, 
2008),  live  on  a  low-dimensional  subspace  (Rai  & 
Daume  III,  2010),  share  feature  representations  (Ar- 
gyriou  et  al.,  2007),  or  by  modeling  the  task  covariance 
matrix  (Bonilla  et  al.,  2007;  Zhang  &  Yeung,  2010). 
Choosing  the  correct  notion  of  task  relatedness  is  cru¬ 
cial  to  the  effectiveness  of  any  MTL  method.  Incorrect 
assumptions  can  hurt  performance  and  it  is  desirable 
to  have  a  flexible  model  that  can  automatically  adapt 
its  assumptions  for  a  given  problem. 

Motivated  by  this,  we  propose  a  nonparametric 
Bayesian  MTL  model  by  representing  the  task  param¬ 
eters  (e.g.,  the  weight  vectors  for  logistic  regression 
models)  as  being  generated  from  a  nonparametric  mix¬ 
ture  of  nonparametric  factor  analyzers.  Parameters 
are  shared  only  between  tasks  in  the  same  cluster  and, 
within  each  cluster,  across  a  linear  subspace  that  reg¬ 
ularizes  what  is  shared.  Moreover,  by  virtue  of  this 
being  a  nonparametric  model,  various  existing  MTL 
models  result  as  special  cases  of  our  model;  for  exam¬ 
ple,  the  weight  vectors  are  drawn  from  a  single  shared 
Gaussian  prior,  or  form  clusters  (equivalently,  gener- 
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ated  from  a  mixture  of  Gaussians),  or  live  close  to  a 
subspace,  etc.  Our  model  can  automatically  interpo¬ 
late  between  these  assumptions  as  needed,  providing 
the  best  fit  to  the  given  MTL  problem. 

In  addition  to  offering  a  general  framework  for  mul¬ 
titask  learning,  our  proposed  model  also  addresses 
several  shortcomings  of  commonly  used  MTL  mod¬ 
els.  For  example,  task  clustering  (Xue  et  ah,  2007), 
which  fits  a  full-covariance  Gaussian  mixture  model 
over  the  weight  vectors,  is  prone  to  overfitting  on  high 
dimensional  problems  as  the  number  of  learning  tasks 
is  usually  much  smaller  than  the  dimensionality,  mak¬ 
ing  it  difficult  to  estimate  the  covariance  matrix.  A 
model  based  on  mixtures  of  factor  analyzers,  like  ours, 
can  deal  with  this  issue  by  adaptively  estimating  the 
dimensionality  of  each  component,  using  less  parame¬ 
ters  than  in  the  full  rank  case.  Likewise,  models  based 
on  task  subspaces  (Zhang  et  ah,  2006;  Rai  &  Daume 
III,  2010;  Agarwal  et  ah,  2010)  assume  that  the  weight 
vectors  of  all  the  tasks  live  on  or  close  to  a  single  shared 
subspace,  which  is  known  to  lead  to  negative  transfer 
in  the  presence  of  outlier  tasks.  Our  model,  based  on 
a  mixture  of  subspaces,  circumvents  these  issues  by 
allowing  different  groups  of  weight  vectors  to  live  in 
different  subspaces  when  grouping  all  together  them 
would  not  fit  the  data  well.  One  can  also  view  our 
model  as  allowing  the  sharing  of  statistical  strengths 
at  two  levels:  (1)  by  exploiting  the  cluster  structure, 
and  (2)  by  additionally  exploiting  the  subspace  struc¬ 
ture  within  each  cluster. 

2.  Background 

In  the  context  of  MTL,  since  the  task  relatedness 
structure  is  usually  unknown,  the  standard  solution 
is  to  try  many  different  models,  covering  many  simi¬ 
larity  assumptions,  with  many  settings  of  complexity 
for  each  model,  and  choose  the  one  according  to  some 
model  selection  criteria.  In  this  paper,  we  take  a  non- 
parametric  Bayesian  approach  to  this  problem  (using 
the  Dirichlet  Process  and  the  Indian  Buffet  Process  as 
building  blocks)  such  that  the  appropriate  MTL  model 
capturing  the  correct  task  relatedness  structure  and 
the  model  complexity  for  that  model  will  be  learned 
in  a  data-driven  manner  side-stepping  the  model  se¬ 
lection  issues. 

2.1.  The  Dirichlet  Process 

The  Dirichlet  Process  (DP)  is  a  prior  distribution  over 
discrete  distributions  (Ferguson,  1973).  Discreteness 
implies  that  if  one  draws  samples  from  a  distribution 
drawn  from  the  DP,  the  samples  will  cluster:  new  sam¬ 
ples  take  the  same  value  as  older  samples  with  some 
positive  probability.  A  DP  is  defined  by  two  parame¬ 
ters:  a  concentration  parameter  a  and  a  base  measure 


Go-  The  sampling  process  defining  the  DP  draws  the 
first  sample  from  the  base  measure  Go-  Each  subse¬ 
quent  sample  would  take  on  a  new  value  drawn  from 
Go  with  a  probability  proportional  to  a,  or  reuse  a  pre¬ 
viously  drawn  value  with  probability  proportional  to 
the  number  of  samples  having  that  value.  This  prop¬ 
erty  makes  it  suitable  as  a  prior  for  effectively  infi¬ 
nite  mixture  models,  where  the  number  of  mixtures 
can  grow  as  new  samples  are  observed.  Our  mixture 
of  factor  analyzers  based  MTL  model  uses  the  DP  to 
model  the  mixture  components  so  we  do  not  need  to 
specify  their  number  a  priori. 

2.2.  The  Indian  Buffet  Process 

The  Indian  Buffet  Process  (IBP)  (Griffiths  &  Ghahra- 
mani,  2006)  and  the  closely  related  Beta  Pro¬ 
cess  (Thibaux  &  Jordan,  2007)  define  a  distribution 
on  a  collection  of  sparse  binary  vectors  of  unbounded 
size  (or,  equivalently,  on  sparse  binary  matrices  with 
one  dimension  fixed  but  the  other  being  unbounded). 
Such  sparse  structures  are  commonly  used  in  applica¬ 
tions  such  as  sparse  factor  analysis  (Paisley  &  Carin, 
2009)  where  we  want  to  decompose  a  data  matrix  X 
such  that  each  observation  Xn  £  is  represented  as 
a  sparse  combination  of  a  set  of  K  <C  D  basis  vec¬ 
tors  (or  factors )  but  K  is  not  specified  a  priori.  The 
generative  story  in  the  finite  case  is  (assuming  a  linear 
Gaussian  model  generation): 

Xn  ~  Afor(Abn,a2xI) 

Afc  ~  A/"or(0,  a2 1) 
bkn  ~  Ber(irk) 

7 Tfc  ~  Betfa/K^l) 

In  the  above,  A  is  a  matrix  consisting  of  K  columns 
(the  factors)  and  the  factor  combination  is  defined 
by  the  sparse  binary  vector  bn  of  size  K.  For  the 
more  general  case  of  factor  analysis,  factor  combi¬ 
nation  weights  are  sparse  real-valued  vectors,  so  the 
model  is  of  the  form  Xn  =  A (s„  ©  bn)  +  E,  where  sn  is 
a  real-valued  vector  of  the  same  size  as  bn  (Paisley  & 
Carin,  2009)  and  can  be  given  a  Gaussian  prior,  and 
©  is  the  elementwise  product.  Our  mixture  of  factor 
analyzers  based  MTL  model  uses  the  IBP/Beta  Pro¬ 
cess  to  model  each  factor  analyzer  so  we  do  not  need 
to  specify  the  number  of  factors  K  a  priori. 

3.  Mixture  of  Factor  Analyzers  based 
Generative  Model  for  MTL 

Our  proposed  model  assumes  that  the  parameters  (i.e. , 
the  weight  vector)  of  each  task  are  sampled  from  a  mix¬ 
ture  of  factor  analyzers  (Ghahramani  &  Beal,  2000). 
Note  that  our  model  is  defined  over  latent  weight  vec¬ 
tors  whereas  the  standard  mixture  of  factor  analyzers 
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is  commonly  defined  to  model  observed  data. 

© 


Ytti  - 

-  Nor{6j  XtA,l) 
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Ft,  At  - 
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G  - 
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Figure  1.  A  graphical  depiction  of  our  model.  The  task 
parameters  9  are  sampled  from  a  DP-IBP  mixture  and  used 
to  generate  the  Y  values. 

We  assume  that  we  are  learning  T  related  tasks,  where 
each  task  is  represented  by  a  weight  vector  0t  £  R15 
that  is  assumed  to  be  sampled  from  a  mixture  of  F 
factor  analyzers  where  each  factor  analyzer  consists 
of  K  <  min{T,  D}  factors  (note:  our  model  also  al¬ 
lows  each  factor  analyzer  to  have  a  different  number 
of  factors).  Here  D  denotes  the  number  of  features 
in  the  data.  Each  task  is  a  set  of  X  and  Y  values, 
and  each  Y  is  assumed  to  be  generated  from  the  cor¬ 
responding  X  value  and  task  weight  vector.  In  our 
model,  the  weight  vector  9t  for  task  t  is  generated  by 
first  sampling  a  factor  analyzer  (defined  by  a  mean 
task  parameter  ptt  £  R13  and  a  factor  loading  matrix 
A t  £  RDxK)  using  the  DP,  and  then  generating  0t  us¬ 
ing  that  factor  analyzer.  In  equations,  this  be  written 
as  0t  =  Ht  +  h-tft  +  St- 

The  weight  vector  9t  is  a  sparse  linear  combination  of 
K  basis  vectors  represented  by  the  columns  of  At  (each 
column  is  a  “basis  task”).  The  combination  weights 
are  given  by  ft  £  which  we  represent  as  st  ©  bt 
where  St  is  a  real  valued  vector  and  bt.  is  a  binary 
valued  vector,  both  of  size  K.  Our  model  uses  a  Beta- 
Bernoulli/IBP  prior  on  bt  to  determine  K1  the  num¬ 
ber  of  factors  in  each  factor  analyzer.  The  {/it,At} 
pair  for  each  task  is  drawn  from  a  DP,  also  giving  the 
tasks  a  clustering  property,  and  there  will  be  a  finite 
number  F  <  T  of  distinct  factor  analyzers.  Finally, 
£t  ~  J\for(0,  ~2 1)  represents  task-specific  noise. 

Figure  1  shows  a  graphical  depiction  of  our  model  and 
Figure  2  shows  the  generative  story  for  the  linear  re¬ 
gression  case  .  The  DP  base  measure  Go  is  a  product 
of  two  Gaussian  priors  for  /it,  A(.  In  our  nonparamet- 
ric  Bayesian  model,  F  and  K  need  not  be  known  a 
priori ;  these  are  inferred  from  the  data. 

For  classification,  the  only  change  is  that  the  first  line 
in  the  generative  model  becomes  Yt y  ~  Ber(sig(9t  • 


Figure  2.  The  hierarchical  model.  The  indicator  variable 
2  of  Fig  1  is  implicit  in  the  draw  from  the  DP.  The  Beta- 
Bernoulli  draw  for  bkt  approximates  the  IBP  for  large  K 
(actual  K  will  be  inferred  from  the  data). 

Xtii)),  where  sig{x)  =  1+eXp(_a;)  is  the  logistic  func¬ 
tion  and  Ber  is  the  Bernoulli  distribution. 

A  number  of  existing  multitask  learning  models  arise 
as  special  cases  of  our  model  as  it  nicely  interpolates 
between  some  different  and  useful  scenarios,  depending 
on  the  actual  inferred  values  of  F  and  K,  for  a  given 
multitask  learning  dataset: 

•  Shared  Gaussian  Prior(A=l,  A'=0):  (Chelba 
&  Acero,  2006).  This  corresponds  to  a  single  fac¬ 
tor  analyzer  modeling  either  a  diagonal  or  full- 
rank  Gaussian  as  the  prior. 

•  Cluster-based  Assumption^  >  1,A'=0): 

(Xue  et  ah,  2007;  Jacob  &  Bach,  2008).  This  cor¬ 
responds  to  a  mixture  of  identity-covariance  or 
full-rank  Gaussians  as  the  prior. 

•  Linear  Subspace  Assumption(A=l,  K  < 
D):  (Zhang  et  ah,  2006;  Rai  &  Daume  III,  2010). 
This  corresponds  to  a  single  factor  analyzer  with 
less  than  full  rank.  Note  that  this  is  also  equiva¬ 
lent  to  the  matrix  0  =  {$i, . . . ,  9t}  being  a  rank- 
K  matrix  (Argyriou  et  al.,  2007). 

•  Nonlinear  Manifold  Assumption:  A  mixture 
of  linear  subspaces  allows  modeling  a  nonlinear 
subspace  (Chen  et  al.,  2010)  and  can  capture  the 
case  when  the  weight  vectors  live  on  a  nonlinear 
manifold  (Ghosn  &  Bengio,  2003;  Agarwal  et  al., 
2010).  Moreover,  in  our  model,  the  manifold’s  in¬ 
trinsic  dimensionality  can  be  different  in  different 
parts  of  the  ambient  space  (since  we  do  not  re¬ 
strict  K  to  be  the  same  for  each  factor  analyzer). 

Our  nonparametric  Bayesian  model  can  interpolate  be¬ 
tween  these  cases  as  appropriate  for  a  given  dataset, 
without  changing  the  model  structure  or  lryperparam- 
eters.  From  a  non-probabilistic  analogy,  our  model 
can  be  seen  as  doing  dictionary  learning/sparse  cod¬ 
ing  (Aharon  et  al.,  2010)  over  the  latent  weight  vec¬ 
tors  (albeit,  using  an  undercomplete  dictionary  set¬ 
ting  since  we  assume  K  <  min{T,  D}).  The  model 
learns  M  dictionaries  of  basis  tasks  (one  dictionary 
per  group/cluster  of  tasks,  and  M  inferred  from  the 
data)  and  tasks  within  each  cluster  are  expressed  as  a 
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sparse  linear  combination  of  elements  from  that  dictio¬ 
nary.  Our  model  can  also  be  generalized  further,  e.g., 
by  replacing  the  Gaussian  prior  on  the  low-dimensional 
latent  task  representations  St  €  by  a  prior  of  the 
form  P(s(+i|st),  one  can  even  relax  the  exchangeabil¬ 
ity  assumption  of  tasks  within  each  group,  and  have 
tasks  that  are  evolving  with  time. 

3.1.  Variational  inference 

As  this  model  is  infinite  and  combinatorial  in  nature, 
exact  inference  is  intractable  and  sampling-based  in¬ 
ference  may  take  too  long  to  converge  (Doshi- Velez 
et  ah,  2009;  Blei  &  Jordan,  2006).  Hence,  we  employ 
a  variational  mean-field  algorithm  to  perform  inference 
in  this  model.  To  do  so,  we  lower-bound  the  marginal 
log-probability  of  Y  given  X  using  a  fully  factored  ap¬ 
proximating  distribution  Q  over  the  model  parameters 
0,/bA  ,z,b,s: 

log  P(Y\X)  =  logEP[P(Y\X,0,ti,A,z,b,s)] 

>  EQ[\ogP{Y\X)\ 
-EQ[logQ(Y\X)\. 

To  do  so,  we  approximate  the  DP  and  the  IBP  with 
a  tractable  distribution  Q.  For  the  DP  we  use  a  fi¬ 
nite  stick-breaking  distribution,  based  on  the  infinite 
stick-breaking  representation  of  the  DP  (Blei  &  Jor¬ 
dan,  2006).  In  this  representation,  we  introduce,  for 
each  0t ,  a  multinomial  random  variable  Zt  that  indexes 
the  infinite  set  of  possible  mixture  parameters  fi  and 
A.  The  zt  vector  is  nonzero  on  its  i-th  component 
with  probability  4>i  IW1  -  where  </>  is  an  in¬ 
finite  set  of  independent  Bet( l,ai)  random  variables 
(Bet  is  the  Beta  distribution).  A  finite  approximation 
to  the  DP  is  obtained  by  setting  a  given  fai  to  1,  which 
sets  the  probability  of  zj  for  j  >  i  necessarily  to  0. 
While  there  is  a  similar  stick-breaking  construction  to 
the  IBP  (Teh  et  al.,  2007),  it  is  not  in  the  exponen¬ 
tial  family  and  requires  complicated  approximations, 
so  we  represent  the  IBP  by  its  finite  Beta-Bernoulli 
approximation  (Doshi- Velez  et  ah,  2009). 

The  distribution  we  are  approximating  then  (for  the 
linear  regression  case)  is  shown  in  Figure  3  (top).  The 
stick-breaking  distribution  SBP  which  is  the  prior  for 
zt  is  such  that  P(zt=i )  =  fa  TIk^1  ~  4>j)- 

In  our  variational  distribution,  we  set  the  number  of 
factor  analyzers  in  the  truncated  stick-breaking  rep¬ 
resentation  to  a  hyperparameter  F  and  the  number 
of  factors  in  each  such  analyzer  to  a  truncation  level 
hyperparameter  K .  After  inference,  if  the  truncation 
levels  are  set  high  enough,  most  factor  analyzers  (and 
factors  within  each  factor  analyzer)  will  not  be  used, 
effectively  approximating  the  property  of  the  infinite 
model  that  only  a  small  finite  number  of  components 
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Figure  3.  Top:  the  distribution  being  approximated.  Bot¬ 
tom:  Our  approximating  Q  distribution  (note:  P(Y\9)  is 
lower-bounded  directly) 

is  ever  used  to  model  a  finite  data  set.  It  is  worthwhile 
to  note  that  while  the  solution  found  by  the  variational 
approximation  is  necessarily  finite  and  with  complex¬ 
ity  bounded  by  the  truncation  parameters,  it  will  still 
implicitly  perform  model  selection.  Therefore,  more 
often  than  not,  it  will  concentrate  most  of  its  posterior 
mass  on  models  with  less  complexity  than  the  trunca¬ 
tion  parameters  suggest.  Ishwaran  &  James  (2001) 
present  two  theorems  to  help  choose  these  truncation 
levels,  as  using  smaller  values  of  F  and  I\  (particularly 
K,  as  the  update  equations  are  quadratic  in  K)  can 
lead  to  significant  savings  of  computing  time  (in  our 
experiments,  we  simply  set  these  to  min{I?,  T})  which 
we  found  to  be  sufficient). 

Our  approximating  Q  distribution  is  shown  in  Figure  3 
(bottom).  For  the  linear  regression  case,  we  treat 
P(Y\9)  by  lower-bounding  it  directly,  without  intro¬ 
ducing  an  approximating  distribution  for  Y.  In  the 
case  of  logistic  regression,  we  use  the  lower  bound  by 
(Jaakkola  &  Jordan,  1996)  that  allows  us  to  integrate 
out  the  logistic  function. 

Apart  from  approximating  the  DP  with  the  truncated 
stick-breaking  prior,  approximating  the  IBP  with  a  set 
of  symmetric,  finite  Beta  distributed  variables,  and 
lower-bounding  the  logistic  function  with  a  quadratic, 
all  the  computations  involved  in  deriving  the  varia¬ 
tional  lower  bound  are  straightforward  exponential- 
family  computations.  Note  that  for  Q  we  could  use 
more  general  covariances  instead  of  the  identity  ma¬ 
trices.  In  practice,  we  found  that  this  did  not  improve 
classification  performance,  and  it  would  imply  on  a 
significantly  higher  computational  cost.  Another  less 
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expensive  option  however  would  be  to  use  the  same 
hyperparameter  for  each  feature,  i.e. ,  a  spherical  (in¬ 
stead  of  diagonal)  covariance  r2I  which  would  require 
optimizing  w.r.t.  a  single  hyperparameter  r.  The  vari¬ 
ational  parameter  updates  are: 
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In  the  above  \I/  denotes  the  digamma  function.  While 
it  is  possible  to  update  analytically,  the  update 
requires  inverting  a  matrix,  and  in  our  experiment  this 
matrix  was  often  ill-conditioned,  so  we  updated  vgt 
by  optimizing  the  lower  bound  with  the  L-BFGS-B 
optimizer  (Zhu  et  ah,  1997).  The  optimizer  is  run  until 
convergence  at  each  iteration,  warm-started  with  the 
previous  value.  We  note  that  it  could  be  replaced  by 
any  other  optimizer,  including  gradient  methods,  with 
no  changes  in  the  above  equations. 


The  complete  derivations  are  provided  in  the  the  sup¬ 
plementary  material. 


For  regression,  the  gradient  of  the  lower  bound  with 
respect  to  vgt  is 

V L(vgt )  =a^uZtJ  (vgt  -  v^f  -  vKf{vSt  f  ®vbt  f)) 
f 

Nt 

+Y,(ytiixt<i-xttix?tiv0t). 

i 

For  classification  the  gradient  is  similar,  the  main 
difference  being  that  there  is  an  extra  factor  in  the 
Xt  .iXjivgt  term  involving  the  variational  parameter 
for  the  lower  bound  of  the  logistic  function. 

We  also  optimize  the  lower  bound  w.r.t  the  precision 
parameter  a  to  obtain  an  empirical  Bayes  estimate: 

1^  _  (  ll^t  ~~  vlif  ~  VKfiystj  ©  vbtJ)\\2 

a  •*—-/  2—d  v**.t  I  KDF 

t  f  \ 

,  Ei^K^  +  ll^ll2)  i\ 

+  KF  +K) 

The  hyperparameters  a.\  and  a.i  are  held  fixed  and  can 
be  optimized  by  cross-validation.  We  initialize  the  in¬ 
ference  process  with  vgt  set  to  the  maximum  likelihood 
solution  to  each  task’s  regression  or  classification  prob¬ 
lem.  Then  we  alternate  updating  all  other  parameters 
to  convergence  and  updating  vgt  given  the  other  pa¬ 
rameters.  The  value  of  vgt ,  and  hence  the  regression  or 
classification  accuracy,  usually  stabilizes  after  the  first 
couple  of  iterations,  and  the  only  changes  observed 
are  further  improvements  to  the  lower  bound.  This 
matches  behavior  observed  in  Ando  &  Zhang  (2005). 
All  our  experiments  were  run  on  three  iterations. 

4.  Experiments 

We  present  results  on  both  synthetic  and  real-world 
datasets,  and  on  linear  regression  and  classification 
settings.  As  a  sanity  check  to  show  that  our  model  can 
learn  the  underlying  latent  task  structures  correctly, 
we  generated  a  synthetic  data  consisting  of  5  clusters 
of  tasks.  Each  cluster  consists  of  10  binary  classifi¬ 
cation  tasks,  having  100  examples  each.  We  used  a 
50/50  split  for  train/test  data.  Each  task  is  repre¬ 
sented  by  a  weight  vector  of  length  D  =  20.  Figure  4 
(left)  shows  the  true  correlation  structure  of  the  tasks 
and  Figure  4  (right)  shows  the  recovered  structure  by 
our  model:  it  correctly  infers  the  correct  number  (5) 
of  clusters.  Our  model  resulted  in  a  classification  ac¬ 
curacy  of  83.2%,  whereas  independently  learned  tasks 
resulted  in  an  accuracy  of  79.2%. 

Our  next  set  of  experiments  compare  our  model  with 
a  number  of  baseline  methods  on  several  synthetic  and 
real-world  multitask  regression  and  multitask  classifi¬ 
cation  problems.  Our  baselines  include: 
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Figure  4.  Left:  Plot  of  the  correlation  matrix  of  the 
ground-truth  weight  vectors  of  the  50  tasks.  Right:  In¬ 
ferred  correlation  matrix 

•  Independently  learned  tasks  -  STL:  assumes  the 
tasks  are  independent  (no  information  sharing). 

•  Multitask  Feature  Learning  -  MTFL:  assumes 
the  tasks  share  a  common  set  of  features  (Ar- 
gyriou  et  ah,  2007). 

•  Shared  Gaussian  prior  over  the  weight  vectors  - 
PRIOR  (Chelba  &  Acero,  2006):  assumes  the 
tasks  are  drawn  from  a  shared  Gaussian  prior  with 
a  unknown  but  fixed  mean  and  covariance. 

•  Single  shared  subspace  -  RANK  (Zhang  et  ah, 
2006;  Rai  &  Daume  III,  2010):  assumes  the  tasks 
live  close  to  a  linear  subspace  (also  equivalent  to 
the  matrix  of  the  weight  vector  being  low-rank) . 

•  DP  mixture  model  based  task  clustering  -  DP- 
MTL  (Xue  et  ah,  2007):  assumes  the  weight  vec¬ 
tors  are  generated  from  a  mixture  model,  each 
component  being  a  full-rank  Gaussian. 

•  Learning  with  Whom  to  Share  -  LWS  (Kang 
et  ah,  2011).  It  is  an  integer  programming  based 
method  that  learn  the  task  grouping  structure 
(with  pre-specified  number  of  groups)  and  encour¬ 
ages  the  tasks  within  each  group  to  share  features. 

Of  these  baselines,  MTFL  and  LWS  were  used  for  re¬ 
gression  problems  only  since  the  publicly  available  im¬ 
plementations  are  for  regression.  In  the  experiments, 
we  would  refer  to  our  model  as  MFA-MTL  (Mixture 
of  Factor  Analyzers  for  MultiTask  Learning).  In  all 
our  experiments,  we  set  the  hyperparameters  aii  =  1 
and  «2  =  5,  as  these  values  performed  reasonably  in 
preliminary  experiments.  The  truncation  level  for  the 
DP  can  be  chosen  to  be  equal  to  the  number  of  tasks 
T,  and  for  the  IBP,  to  be  the  minimum  of  T  and  the 
number  of  features  D  in  the  data.  This  is  often  more 
than  necessary  and  in  most  of  our  experiments,  much 
smaller  truncation  levels  were  found  to  be  sufficient. 

For  our  multitask  regression  experiments,  we  com¬ 
pared  MFA-MTL  with  STL,  MTFL,  and  LWS  (we 
skip  the  other  baselines  as  they  performed  compara¬ 
bly  or  worse  than  MTFL/LWS).  For  this  experiment, 


Synthetic 

School 

Computer 

STL 

1.35 

468.7 

153.3 

MTFL 

0.36 

376.1 

30.4 

LWS 

0.37 

430.9 

30.2 

MFA-MTL 

0.18 

374.5 

29.8 

Table  1.  Mean  squared  error  (MSE)  of  various  methods  on 
multitask  regression  problems 


Landmine 

20ng 

STL 

52Wo 

69.3% 

PRIOR 

52.9% 

75.8% 

RANK 

53.8% 

75.8% 

DP-MTL 

53.8% 

75.7% 

MFA-MTL 

62.4% 

76.9% 

Table  2.  Multitask  classification  accuracies  of  various 
methods  on  the  Landmine  and  20ng  datasets 

we  used  three  datasets  -  one  synthetic  dataset  used 
in  (Kang  et  ah,  2011),  and  two  real-world  datasets 
used  commonly  in  the  multitask  learning  literature: 

(1)  School:  This  dataset  consists  of  the  examination 
scores  of  15362  students  from  139  schools  in  London. 
Each  school  is  a  task  so  there  are  a  total  of  139  tasks  for 
this  dataset.  (2)  Computer:  This  dataset  consists  of 
a  survey  of  190  students  about  the  chances  of  purchas¬ 
ing  20  different  personal  computers.  There  are  a  total 
of  190  tasks,  20  examples  per  task,  and  13  features 
per  example.  For  the  synthetic  data,  we  followed  the 
similar  procedure  for  train/test  split  as  used  by  (Kang 
et  ah,  2011).  For  School  and  Computer  datasets,  we 
split  the  data  equally  into  training  and  test  set  and 
further  only  used  20%  of  the  training  data  (training 
set  deliberately  kept  small  as  is  often  the  case  with 
multitask  learning  problems  in  practice).  The  average 
mean  squared  errors  (i.e. ,  across  tasks)  in  predicting 
the  responses  by  each  method  are  shown  in  Table  1. 
As  shown  in  Table  1,  MFA-MTL  outperforms  the  other 
baselines  on  all  the  datasets.  Moreover,  for  the  syn¬ 
thetic  data,  we  found  that  it  also  inferred  the  number 
of  task  groups  (3)  correctly  (the  LWS  baseline  needs 
this  number  to  be  specified  -  we  ran  it  with  the  ground 
truth).  On  the  school  and  computer  datasets,  MFA- 
MTL  outperforms  STL  and  LWS  and  does  slightly 
better  than  MTFL.  For  LWS  on  these  two  datasets, 
we  report  the  best  results  as  obtained  by  varying  the 
number  of  groups  from  1  to  20. 

We  next  experiment  with  the  classification  setting. 
For  this,  we  chose  two  datasets:  (1)  Landmine:  The 
landmine  detection  dataset  is  a  subset  of  the  dataset 
used  in  the  symmetric  multitask  learning  experiment 
by  (Xue  et  ah,  2007).  It  contains  19  classification  tasks 
and  the  tasks  are  known  to  be  clustered  for  this  data. 

(2)  20ng:  We  did  the  standard  training/test  split  of 
20  Newsgroups  for  multitask  learning,  following  Raina 
et  al.  (2006),  and  used  a  50/50  split  for  the  landmine 
data.  The  classification  accuracies  reported  by  our 
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Landmine:  Accuracy  vs  amount  of  training  data 


Figure  5.  Average  accuracies  w.r.t.  varying  amount  of 
training  data  (left:  landmine  data,  right:  20ng  data). 

model  and  the  various  baselines  on  landmine  and  20 
Newsgroups  datasets  are  shown  in  Table  2.  As  shown 
in  Table  2,  our  method  outperforms  the  various  base¬ 
lines.  We  note  that  3  of  them  (PRIOR,  RANK,  and 
DP-MTL),  which  are  methods  proposed  in  prior  work, 
are  special  cases  of  our  model  (as  discussed  in  Sec¬ 
tion  3).  In  particular,  RANK  performs  worse  than  our 
method,  potentially  because  all  weight  vectors  share 
the  same  subspace  which  may  not  be  desirable  if  not 
all  the  tasks  are  related  with  each  other.  DP-MTL 
performs  worse  than  our  method,  potentially  because 
it  fits  a  full-rank  Gaussian  for  each  mixture  compo¬ 
nent  and  is  especially  prone  to  overfit  if  the  number  of 
tasks  is  smaller  than  the  number  of  features. 

Finally,  we  investigated  the  behavior  of  different  algo¬ 
rithms  in  the  small  training  data  regimes.  For  this,  we 
varied  the  amount  of  training  examples  per  task  (for 
landmine  data,  we  varied  the  fraction  from  20%  to 
100%;  for  20  Newsgroup,  we  varied  the  number  of  ex¬ 
amples  from  20  to  100).  Results  are  shown  in  Figure  5. 
To  uncrowd  the  figure,  we  compare  only  with  STL  and 
DP-MTL  (the  best  performing  baseline).  In  the  small 
data  regimes,  our  algorithm  performs  better  as  com¬ 
pared  to  both  STL  and  DP-MTL.  Another  important 
aspect  of  an  MTL  algorithm  is  its  asymptotic  behavior 
in  the  limit  of  large  training  data  per  task.  For  this 
experiment,  we  compared  MFA-MTL  with  STL  on  the 
school  multitask  regression  dataset  by  providing  each 
algorithm  the  complete  training  data.  MFA-MTL  re¬ 
sulted  in  an  MSE  of  261.4  as  compared  to  STL  which 
gave  an  MSE  of  271.1.  Therefore  our  algorithm  tends 
to  do  comparably  (in  fact,  marginally  better)  to  in¬ 
dependently  learned  tasks  even  when  the  amount  of 
training  data  per  task  is  sufficiently  large. 

5.  Related  Work 

Apart  from  the  prior  work  on  multitask  learning  dis¬ 
cussed  in  Section  1,  our  model  is  based  on  a  somewhat 
similar  motivation  as  the  model  proposed  in  (Argyriou 
et  al.,  2008).  Their  model  assumes  that  tasks  can  be 
partitioned  into  groups  and  tasks  within  each  group 
share  a  kernel.  Their  assumption  is  an  extension  of  the 
earlier  work  on  Multitask  Feature  Learning  (Argyriou 


et  al.,  2007)  (one  of  the  baselines  we  used  in  our  ex¬ 
periments)  that  assumes  all  tasks  share  the  common 
kernel.  In  (Kumar  &  Daume  III,  2012),  the  authors  as¬ 
sume  that  there  is  single  set  of  task  basis  vectors  (i.e. , 
a  task  dictionary)  and  each  task  is  a  sparse  combina¬ 
tion  of  these  basis  vectors.  In  their  model,  the  number 
of  basis  vectors  shared  between  two  tasks  (i.e.,  their 
“overlap”)  can  be  seen  as  the  pairwise  task  similarity. 
In  Kang  et  al.  (2011),  the  authors  proposed  a  model 
based  on  the  assumption  that  the  tasks  exist  in  groups 
and  the  tasks  within  each  group  share  features,  which 
is  again  similar  in  spirit  to  our  work  (this  model  was 
one  of  our  baselines  in  the  experiments).  In  contrast, 
the  generative  model  we  presented  in  this  paper  of¬ 
fers  a  number  of  advantages  over  these  models  such  as 
the  ability  to  deal  with  missing  data  in  a  principled 
manner,  doing  automatic  model  complexity  control  in 
a  nonparametric  Bayesian  setting,  and  being  flexible 
enough  to  subsume  these  and  many  other  notions  as 
task  relatedness  used  in  multitask  learning. 

Among  other  related  work,  Canini  et  al.  (2010)  pro¬ 
pose  hierarchical  Dirichlet  process  models  as  good 
models  for  human  categorical  learning.  The  idea  is 
that  one  can  model  transfer  learning  by  assuming 
that  people  unsupervisedly  learn  subgroups  of  known 
classes  and  use  these  groups  to  refine  the  knowledge 
of  new  classes  by  sharing  subgroups  via  a  hierarchical 
Dirichlet  process.  Our  model  can  be  seen  as  a  discrim¬ 
inative  analog  of  their  generative  model,  where  aspects 
of  the  task  parameter — instead  of  the  distribution  of 
the  test  examples — are  shared  among  similar  tasks  and 
the  sharing  structure  is  discovered  automatically. 

6.  Future  Work  and  Discussion 

We  proposed  and  evaluated  a  nonparametric  Bayesian 
multitask  learning  model  that  usefully  interpolates  be¬ 
tween  many  different  previously  proposed  models  for 
estimating  task  parameters  of  multiple  related  learn¬ 
ing  problems,  such  as  a  shared  Gaussian  prior  (Chelba 
&  Acero,  2006),  a  clustering  structure  (Xue  et  al., 
2007),  reduced  dimensionality  (Argyriou  et  al.,  2007; 
Zhang  et  al.,  2006),  manifold  structure  (Ghosn  &  Ben- 
gio,  2003;  Agarwal  et  al.,  2010),  etc.  We  presented  a 
variational  mean-field  algorithm  for  this  model  that 
exhibits  competitive  results  on  a  set  of  synthetic  as 
well  as  real-world  multitask  learning  datasets.  The 
proposed  model,  by  using  the  flexibility  afforded  by 
nonparametric  Bayesian  techniques,  requires  only  min¬ 
imal  assumptions  to  be  applied  to  any  given  multitask 
learning  problem.  A  possible  future  work  is  studying 
a  hierarchical  Dirichlet  process  variant  of  this  model 
where  different  tasks  are  allowed  to  share  exactly  the 
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same  6  parameters,  which  might  be  beneficial  in  cases 
where  training  data  is  especially  sparse  or  the  tasks 
are  more  strongly  clustered. 
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